04-Multinomial-Regression (Score: 105.0 / 105.0)

  1. Test cell (Score: 5.0 / 5.0)
  2. Test cell (Score: 10.0 / 10.0)
  3. Test cell (Score: 5.0 / 5.0)
  4. Test cell (Score: 5.0 / 5.0)
  5. Test cell (Score: 5.0 / 5.0)
  6. Test cell (Score: 5.0 / 5.0)
  7. Test cell (Score: 5.0 / 5.0)
  8. Test cell (Score: 5.0 / 5.0)
  9. Coding free-response (Score: 20.0 / 20.0)
  10. Task (Score: 40.0 / 40.0)
In [1]:
NAME = "Abhinav Asheesh Lugun"
ID = "122322"

Machine Learning Lab 04: Multinomial Logistic Regression

Generalized Linear Models

From lecture, we know that members of the exponential family distributions can be written in the form $$p(y;\eta) = b(y)e^{(\eta^\top T(y)-a(\eta))},$$ where

  • $\eta$ is the natural parameter or canonical paramter of the distribution,
  • $T(y)$ is the sufficient statistic (we normally use $T(y) = y$),
  • $b(y)$ is an arbitrary scalar function of y, and
  • $a(\eta)$ is the log partition function. We use $e^{a(\eta)}$ just to normalize the distribution to have a sum or integral of 1.

Each choice of $T$, $a$, and $b$ defines a family (set) of distributions parameterized by $\eta$.

If we can write $p(y \mid \mathbf{x} ; \theta)$ as a member of the exponential family of distributions with parameters $\mathbf{\eta}$ with $\eta_i = \theta^\top_i \mathbf{x}$, we obtain a generalized linear model that can be optimized using the maximum likelihood principle.

The GLM for the Gaussian distribution with natural parameter $\eta$ being the mean of the Gaussian gives us ordinary linear regression.

The Bernoulli distribution with parameter $\phi$ can be written as an exponential distribution with natural parmeter $\eta = \log \frac{\phi}{1-\phi}$. The GLM for this distribution is logistic regression.

When we write the multinomial distribution with paremeters $\phi_i > 0$ for classes $i \in 1..K$ with the constraint that $$\sum_{i=1}^{K} \phi_i = 1$$ as a member of the exponential family, the resulting GLM is called multinomial logistic regression. The parameters $\phi_1, \ldots, \phi_K$ are written in terms of $\theta$ as $$\phi_i = p(y = i \mid \mathbf{x}; \theta) = \frac{e^{\theta^\top_i \mathbf{x}}}{\sum_{j=1}^{K}e^{\theta^\top_j \mathbf{x}}}. $$

Optimizing a Multinomial Regression Model

In multinomial regression, we have

  1. Data are pairs $\mathbf{x}^{(i)}, y^{(i)}$ with $\mathbf{x}^{(i)} \in \mathbb{R}^n$ and $y \in 1..K$.

  2. The hypothesis is a vector-valued function $$\mathbf{h}_\theta(\mathbf{x}) = \begin{bmatrix} p(y = 1 \mid \mathbf{x} ; \theta) \

                                      p(y = 2 \mid \mathbf{x} ; \theta) \\
                                      \vdots \\
                                      p(y = K \mid \mathbf{x} ; \theta) \end{bmatrix},$$
    

    where $$p(y = i \mid \mathbf{x}) = \phi_i = p(y = i \mid \mathbf{x}; \theta) = \frac{e^{\theta^\top_i \mathbf{x}}}{\sum_{j=1}^{K}e^{\theta^\top_j \mathbf{x}}}. $$

We need a cost function and a way to minimize that cost function. As usual, we try to find the parameters maximizing the likelihood or log likelihood function, or equivalently, minimizing the negative log likelihood function:

$$\theta^* = \text{argmax}_\theta {\cal L}(\theta) = \text{argmax}_\theta \ell(\theta) = \text{argmin}_\theta J(\theta),$$

where $$\begin{eqnarray} J(\theta) & = & - \ell(\theta) \\ & = & - \sum_{i=1}^m \log p(y^{(i)} \mid \textbf{x}^{(i)} ; \theta). \end{eqnarray}$$

Now that we know what is $J(\theta)$, let's try to find its minimimum by taking the derivatives with respect to an arbitrary parameter $\theta_{kl}$, the $l$-th element of the parameter vector $\theta_k$ for class $k$. Before we start, let's define a variable $a_k$ as the linear activation for class $k$ in the softmax function: $$ a_k = \theta_k^\top \mathbf{x}^{(i)}, $$ and rewrite the softmax more conveniently as $$ \phi_k = \frac{e^{a_k}}{\sum_{j=1}^K e^{a_j}}. $$ That makes it a little easier to compute the gradient: $$\begin{eqnarray} \frac{\partial J}{\partial \theta_{kl}} & = & - \sum_{i=1}^m \frac{1}{\phi_{y^{(i)}}} \frac{\partial \phi_{y^{(i)}}}{\partial \theta_{kl}}. \\ \end{eqnarray}$$ Using the chain rule, we have $$\frac{\partial \phi_{y^{(i)}}}{\partial \theta_{kl}} = \sum_{j=1}^K \frac{\partial \phi_{y^{(i)}}}{\partial a_j} \frac{\partial a_j}{\partial \theta_{kl}}$$ The second factor is easy: $$ \frac{\partial a_j}{\partial \theta_{kl}} = \delta(k=j)x^{(i)}_l. $$ For the first factor, we have $$\begin{eqnarray} \frac{\partial \phi_{y^{(i)}}}{\partial a_j} & = & \frac{ \left[ \delta(y^{(i)}=j)e^{a_j} \sum_{c=1}^K e^{a_c} \right] - e^{a_j} e^{a_j} }{\left[ \sum_{c=1}^K e^{a_c} \right]^2} \\ & = & \delta(y^{(i)}=j) \phi_j - \phi_j^2 \end{eqnarray}$$

Substituting what we've derived into the definition above, we obtain $$ \frac{\partial J}{\theta_{kl}} = - \sum_{i=1}^m \sum_{j=1}^K (\delta(y^{(i)}=j) - \phi_j) \frac{\partial a_j}{\partial \theta_{kl}}. $$

There are two ways to do the calculation. In deep neural networks with multinomial outputs, we want to first calculate the $\frac{\partial J}{\partial a_j}$ terms then use them to calculate $\frac{\partial J}{\partial \theta_{kl}}$.

However, if we only have the "single layer" model described up till now, we note that $$\frac{\partial a_j}{\partial \theta_{kl}} = \delta(j=k) x^{(i)}_l,$$ so we can simplify as follows: $$\begin{eqnarray} \frac{\partial J}{\theta_{kl}} & = & - \sum_{i=1}^m \sum_{j=1}^K (\delta(y^{(i)}=j) - \phi_j) \frac{\partial a_j}{\partial \theta_{kl}} \\ & = & - \sum_{i=1}^m \sum_{j=1}^K (\delta(y^{(i)}=j) - \phi_j) \delta(j=k) x^{(i)}_l \\ & = & - \sum_{i=1}^m (\delta(y^{(i)}=k) - \phi_k) x^{(i)}_l \\ \end{eqnarray}$$

Put It Together

OK! Now we have all 4 criteria for our multinomial regression model:

  1. Data are pairs $\mathbf{x}^{(i)}, y^{(i)}$ with $\mathbf{x}^{(i)} \in \mathbb{R}^n$ and $y \in 1..K$.

  2. The hypothesis is a vector-valued function $$\mathbf{h}_\theta(\mathbf{x}) = \begin{bmatrix} p(y = 1 \mid \mathbf{x} ; \theta) \

                                      p(y = 2 \mid \mathbf{x} ; \theta) \\
                                      \vdots \\
                                      p(y = K \mid \mathbf{x} ; \theta) \end{bmatrix},$$
    

    where $$p(y = i \mid \mathbf{x}) = \phi_i = p(y = i \mid \mathbf{x}; \theta) = \frac{e^{\theta^\top_i \mathbf{x}}}{\sum_{j=1}^{K}e^{\theta^\top_j \mathbf{x}}}. $$

  3. The cost function is $$J(\theta) = - \sum_{i=1}^m \log p(y^{(i)} \mid \textbf{x}^{(i)})$$

  4. The optimization algorithm is gradient descent on $J(\theta)$ with the update rule $$\theta_{kl}^{(n+1)} \leftarrow \theta_{kl}^{(n)} - \alpha \sum_{i=1}^m (\delta(y^{(i)}=k) - \phi_k) x^{(i)}_l.$$

Multinomial Regression Example

The following example of multinomial logistic regression is from Kaggle.

The data set is the famous Iris dataset from the UCI machine learning repository.

The data contain 50 samples from each of three classes. Each class refers to a particular species of the iris plant. The data include four independent variables:

  1. Sepal length in cm
  2. Sepal width in cm
  3. Petal length in cm
  4. Petal width in cm

The target takes on one of three classes:

  1. Iris Setosa
  2. Iris Versicolour
  3. Iris Virginica

To predict the target value, we use multinomial logistic regression for $k=3$ classes i.e. $y \in \{ 1, 2, 3 \}$.

Given $\mathbf{x}$, we would like to predict a probability distribution over the three outcomes for $y$, i.e., $\phi_1 = p(y=1 \mid \mathbf{x})$, $\phi_2 = p(y=2 \mid \mathbf{x})$, and $\phi_3 = p(y=3 \mid \mathbf{x})$.

In [2]:
# importing libraries
import numpy as np
import pandas as pd 
import random
import math

The phi function returns $\phi_i$ for input patterns $\mathtt{X}$ and parameters $\theta$.

In [3]:
def phi(i, theta, X, num_class):
    """
    Here is how to make documentation for your function show up in intellisense.
    Explanation you put here will be shown when you use it.
    
    To get intellisense in your Jupyter notebook:
        - Press 'TAB' after typing a dot (.) to see methods and attributes
        - Press 'Shift+TAB' after typing a function name to see its documentation

    The `phi` function returns phi_i = h_theta(x) for input patterns X and parameters theta.
    
    Inputs:
        i=index of phi
        
        X=input dataset
        
        theta=parameters

    Returns:
        phi_i
    """
    mat_theta = np.matrix(theta[i])
    mat_x = np.matrix(X)
    num = math.exp(np.dot(mat_theta, mat_x.T))
    den = 0
    for j in range(0,num_class):
        mat_theta_j = np.matrix(theta[j])
        den = den + math.exp(np.dot(mat_theta_j, mat_x.T))
    phi_i = num / den
    return phi_i

Tips for using intellisense: Shift+TAB

lab4-01.png

The grad_cost function gives the gradient of the cost for data $\mathtt{X}, \mathbf{y}$ for class $j\in 1..k$.

In [4]:
def indicator(i, j):
    '''
    Check whether i is equal to j
    
    Return:
        1 when i=j, otherwise 0
    '''
    if i == j: return 1
    else: return 0


def grad_cost(X, y, j, theta, num_class):
    '''
    Compute the gradient of the cost function for data X, y for parameters of
    output for class j in 1..k
    '''
    m, n = X.shape
    sum = np.array([0 for i in range(0,n)])
    for i in range(0, m):
        p = indicator(y[i], j) - phi(j, theta, X.loc[i], num_class)
        sum = sum + (X.loc[i] * p)
    grad = -sum / m
    return grad

def gradient_descent(X, y, theta, alpha, iters, num_class):
    '''
    Perform iters iterations of gradient descent: theta_new = theta_old - alpha * cost
    '''
    n = X.shape[1]
    for iter in range(iters):
        dtheta = np.zeros((num_class, n))
        for j in range(0, num_class):
            dtheta[j,:] = grad_cost(X, y, j, theta, num_class)
        theta = theta - alpha * dtheta
    return theta

def h(X, theta, num_class):
    '''
    Hypothesis function: h_theta(X) = theta * X
    '''
    X = np.matrix(X)
    h_matrix = np.empty((num_class,1))
    den = 0
    for j in range(0, num_class):
        den = den + math.exp(np.dot(theta[j], X.T))
    for i in range(0,num_class):
        h_matrix[i] = math.exp(np.dot(theta[i], X.T))
    h_matrix = h_matrix / den
    return h_matrix

Exercise 1.1 (5 points)

Create a function to load data from Iris.csv using the Pandas library and extract y from the data.

You can use the Pandas 10 minute guide to learn how to use pandas.

In [5]:
Student's answer(Top)
def load_data(file_name, drop_label, y_label, is_print=False):
    # 1. Load csv file
    data = pd.read_csv(file_name)
    if is_print:
        print(data.head())
    # 2. remove 'Id' column from data
    if drop_label is not None:
        data = data.drop([drop_label],axis=1)
        if is_print:
            print(data.head())
    # 3. Extract y_label column as y from data
    y = None
    # 4. get index of y-column
    y_index = data.columns.get_loc(y_label)
    # 5. Extrack X features from data
    X = None
    # YOUR CODE HERE
    X = data.iloc[:, 0:y_index]
    y = data.iloc[:, y_index]
    
    return X, y
In [6]:
Grade cell: cell-7b5c0f18770fbccb Score: 5.0 / 5.0 (Top)
X, y = load_data('Iris.csv', 'Id', 'Species', True)
print(X.head())
print(y[:5])

# Test function: Do not remove
# tips: this is how to create dataset using pandas
d_ex = {'ID':     [  1,   2,   3,    4,    5,    6,    7],
        'Grade':  [3.5, 2.5, 3.0, 3.75, 2.83, 3.95, 2.68],
        'Type':   ['A', 'B', 'C',  'A',  'C',  'A',  'B']
        }
df = pd.DataFrame (d_ex, columns = ['ID','Grade', 'Type'])
df.to_csv('out.csv', index=False)

Xtest, ytest = load_data('out.csv', 'ID', 'Type')
assert len(Xtest.columns) == 1, 'number of X_columns incorrect (1)'
assert ytest.name == 'Type', 'Extract y_column is incorrect (1)'
assert ytest.shape == (7,), 'number of y is incorrect (1)'
assert 'Grade' in Xtest.columns, 'Incorrect columns in X (1)'
Xtest, ytest = load_data('out.csv', None, 'Type')
assert len(Xtest.columns) == 2, 'number of X_columns incorrect (2)'
assert ytest.name == 'Type', 'Extract y_column is incorrect (2)'
assert ytest.shape == (7,), 'number of y is incorrect (2)'
assert 'Grade' in Xtest.columns and 'ID' in Xtest.columns, 'Incorrect columns in X (2)'
import os
os.remove('out.csv')

assert len(X.columns) == 4, 'number of X_columns incorrect (3)'
assert 'SepalWidthCm' in X.columns and 'Id' not in X.columns and 'Species' not in X.columns, 'Incorrect columns in X (3)'
assert y.name == 'Species', 'Extract y_column is incorrect (3)'
assert y.shape == (150,), 'number of y is incorrect (3)'

print("success!")
# End Test function
   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0   1            5.1           3.5            1.4           0.2  Iris-setosa
1   2            4.9           3.0            1.4           0.2  Iris-setosa
2   3            4.7           3.2            1.3           0.2  Iris-setosa
3   4            4.6           3.1            1.5           0.2  Iris-setosa
4   5            5.0           3.6            1.4           0.2  Iris-setosa
   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0            5.1           3.5            1.4           0.2  Iris-setosa
1            4.9           3.0            1.4           0.2  Iris-setosa
2            4.7           3.2            1.3           0.2  Iris-setosa
3            4.6           3.1            1.5           0.2  Iris-setosa
4            5.0           3.6            1.4           0.2  Iris-setosa
   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
0            5.1           3.5            1.4           0.2
1            4.9           3.0            1.4           0.2
2            4.7           3.2            1.3           0.2
3            4.6           3.1            1.5           0.2
4            5.0           3.6            1.4           0.2
0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Species, dtype: object
success!

Expected result: \ SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \ 0 5.1 3.5 1.4 0.2\ 1 4.9 3.0 1.4 0.2\ 2 4.7 3.2 1.3 0.2\ 3 4.6 3.1 1.5 0.2\ 4 5.0 3.6 1.4 0.2\ 0 Iris-setosa\ 1 Iris-setosa\ 2 Iris-setosa\ 3 Iris-setosa\ 4 Iris-setosa\ Name: Species, dtype: object

Exercise 1.2 (10 points)

Partition data into training and test sets

  • No need to use random.seed function!
  • Ensure that the train set is 70% and the test set is 30% of the data.
  • Encode the labels in the y attribute to be integers in the range 0..k-1.
Hint: You can use the partition function from lab02 if you like

panda.iloc must be used to extract data from an index list

panda.unique will give you the set of unique labels
In [7]:
Student's answer(Top)
def encode (y, y_labels_name):
    for i in range (len(y)):
        if y[i] == y_labels_name[0]:
            y[i] = 0
        elif y[i] == y_labels_name[1]:
            y[i] = 1
        elif y[i] == y_labels_name[2]:
            y[i] = 2
    return y

def partition(X, y, percent_train):
    # 1. create index list
    # 2. shuffle index
    # 3. Create train/test index
    # 4. Separate X_Train, y_train, X_test, y_test
    # 5. Get y_labels_name from y using pandas.unique function
    # 6. Change y_labels_name into string number and put into y_labels_new
    # 7. Drop shuffle index columns
    #     - pandas.reset_index() and pandas.drop(...) might be help
    
    y_labels_name = None
    y_labels_new = None
    
    # YOUR CODE HERE
    idx = np.arange(0,y.shape[0])
    random.shuffle(idx)
    partition_index = int(len(idx) * percent_train)
    
    train_idx = idx[:partition_index]
    test_idx = idx[partition_index:]
    
    X_train = X.iloc[train_idx].reset_index(drop=True)
    y_train = y.iloc[train_idx].reset_index(drop=True).copy()
    
    X_test = X.iloc[test_idx].reset_index(drop=True)
    y_test = y.iloc[test_idx].reset_index(drop=True).copy()
    
    y_labels_name = pd.unique(y)
    y_labels_new = np.array([0,1,2])
    
    y_train = encode(y_train, y_labels_name)
    y_test = encode(y_test, y_labels_name)
    
    return idx, X_train, y_train, X_test, y_test, y_labels_name, y_labels_new
In [8]:
Grade cell: cell-57199b8ae505edd8 Score: 10.0 / 10.0 (Top)
percent_train = 0.7
idx, X_train, y_train, X_test, y_test, y_labels_name, y_labels_new = partition(X, y, percent_train)
print('X_train.shape', X_train.shape)
print('X_test.shape', X_test.shape)
print('y_train.shape', y_train.shape)
print('y_test.shape', y_test.shape)
print('y_labels_name: ', y_labels_name)
print('y_labels_new: ', y_labels_new)
print(X_train.head())
print(y_train.head())

# Test function: Do not remove
assert len(y_labels_name) == 3 and len(y_labels_new) == 3, 'number of y uniques are incorrect'
assert X_train.shape == (105, 4), 'Size of X_train is incorrect'
assert X_test.shape == (45, 4), 'Size of x_test is incorrect'
assert y_train.shape == (105, ), 'Size of y_train is incorrect'
assert y_test.shape == (45, ), 'Size of y_test is incorrect'
assert 'Iris-setosa' in y_labels_name and 'Iris-virginica' in y_labels_name and \
        'Iris-versicolor' in y_labels_name, 'y unique data incorrect'
assert min(y_labels_new) == 0 and max(y_labels_new) < 3, 'label indices are incorrect'

print("success!")
# End Test function
X_train.shape (105, 4)
X_test.shape (45, 4)
y_train.shape (105,)
y_test.shape (45,)
y_labels_name:  ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
y_labels_new:  [0 1 2]
   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
0            5.5           3.5            1.3           0.2
1            7.9           3.8            6.4           2.0
2            4.8           3.4            1.9           0.2
3            7.4           2.8            6.1           1.9
4            6.4           2.9            4.3           1.3
0    0
1    2
2    0
3    2
4    1
Name: Species, dtype: object
success!

Expected result: (*or similar*)\ X_train.shape (105, 4)\ X_test.shape (45, 4)\ y_train.shape (105,)\ y_test.shape (45,)\ y_labels_name: ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica'] \ y_labels_new: [0, 1, 2]

SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm\ 0 6.4 2.8 5.6 2.2\ 1 6.7 3.3 5.7 2.1\ 2 4.6 3.4 1.4 0.3\ 3 5.1 3.8 1.5 0.3\ 4 5.0 2.3 3.3 1.0\ Species\ 0 2\ 1 2\ 2 0\ 3 0\ 4 1

Exercise 1.3 (5 points)

Train your classification model using the gradient_descent function already provided. You might also play around with the gradient descent function to see if you can speed it up!

In [9]:
Student's answer(Top)
# num_class is the number of unique labels
num_class = len(y_labels_name)

if (X_train.shape[1] == X.shape[1]): 
    X_train.insert(0, "intercept", 1)

# Reset m and n for training data
r, c = X_train.shape

# Initialize theta for each class
theta_initial = np.ones((num_class, c))

alpha = .05
iterations = 200

theta = None
# Logistic regression
# YOUR CODE HERE
theta = gradient_descent(X_train, y_train, theta_initial, alpha, iterations, num_class)
In [10]:
Grade cell: cell-4df1fd593dc59103 Score: 5.0 / 5.0 (Top)
print(theta)
print(theta.shape)

# Test function: Do not remove
assert theta.shape == (3, 5), 'Size of theta is incorrect'

print("success!")
# End Test function
[[ 1.17064494  1.31280313  1.83847418 -0.18875576  0.44316451]
 [ 1.08754009  1.14175888  0.7629421   1.20272632  0.85063026]
 [ 0.74181498  0.54543799  0.39858372  1.98602944  1.70620524]]
(3, 5)
success!

Expected result: (*or similar*)\ [[ 1.17632192 1.32360047 1.83204165 -0.20224445 0.44039155]\ [ 1.10140069 1.13537321 0.74833178 1.21907866 0.82567377]\ [ 0.72227738 0.54102632 0.41962657 1.98316579 1.73393467]]\ \ (3, 5)

Exercise 1.4 (5 points)

Let's get your model to make predictions on the test data.

In [11]:
Student's answer(Top)
# Prediction on test data 

if (X_test.shape[1] == X.shape[1]): 
    X_test.insert(0, "intercept", 1)

# Reset m and n for test data
r,c = X_test.shape

y_pred = []
for index,row in X_test.iterrows(): # get a row of X_test data
    # calculate y_hat using hypothesis function
    y_hat = None
    # find the index (integer value) of maximum value in y_hat and input back to prediction
    prediction = None
    # YOUR CODE HERE
    y_hat = h(row, theta, num_class)
    prediction = int(np.argmax(y_hat))
    
    # collect the result
    y_pred.append(prediction)
In [12]:
Grade cell: cell-eb837dfbea00f88e Score: 5.0 / 5.0 (Top)
print(len(y_pred))
print(y_pred[:7])
print(type(y_pred[0]))

# Test function: Do not remove
assert len(y_pred) == 45, 'Size of y_pred is incorrect'
assert isinstance(y_pred[0], int) and isinstance(y_pred[15], int) and isinstance(y_pred[17], int), 'prediction type is incorrect'
assert max(y_pred) < 3 and min(y_pred) >= 0, 'wrong index of y_pred'

print("success!")
# End Test function
45
[2, 0, 1, 1, 0, 1, 2]
<class 'int'>
success!

Expected result: (*or similar*)\ 45 \ [2, 0, 2, 0, 0, 0, 2] \

<class 'int'>

Exercise 1.5 (5 points)

Estimate accuracy of model on test data

$$\text{accuracy} = \frac{\text{number of correct test predictions}}{m_{\text{test}}}$$
In [13]:
Student's answer(Top)
def calc_accuracy(y_test, y_pred):
    accuracy = None
    # YOUR CODE HERE
    num_cor_pred_bool = y_test == y_pred
    num_cor_pred = 0
    for boolean in num_cor_pred_bool:
        if boolean:
            num_cor_pred += 1;
    accuracy = num_cor_pred / len(y_test)
    return accuracy
In [14]:
Grade cell: cell-3c242c8ceb36bdde Score: 5.0 / 5.0 (Top)
accuracy = calc_accuracy(y_test, y_pred)
print('Accuracy: %.4f' % accuracy)

# Test function: Do not remove
assert isinstance(accuracy, float), 'accuracy should be floating point'
assert accuracy >= 0.8, 'Did you train the data?'

print("success!")
# End Test function
Accuracy: 0.9778
success!

Expected result: should be at least 0.8!

On your own in lab

We will do the following in lab:

  1. Write a function to obtain the cost for particular $\mathtt{X}$, $\mathbf{y}$, and $\theta$.
  2. Plot the training set and test cost as training goes on and find the best value for the number of iterations and learning rate.
  3. Make 2D scatter plots showing the predicted and actual class of each item in the training set, plotting two features at a time. Comment on the cause of the errors you observe. If you obtain perfect test set accuracy, re-run the train/test split and rerun the optimization until you observe some mistaken predictions on the test set.

Exercise 2.1 (15 points)

  1. Write a function to obtain the cost for particular $\mathtt{X}$, $\mathbf{y}$, and $\theta$. Name your function my_J() and implement
$$ J_j = -\delta(y, j)\log{\phi_j} $$
In [15]:
Student's answer(Top)
def my_J(theta, X, y, j, num_class):
    cost = None
    # YOUR CODE HERE
    cost = - indicator(y,j) * math.log(phi(j, theta, X, num_class))
    return cost
In [16]:
Grade cell: cell-fbcbb4e3e0175f5b Score: 5.0 / 5.0 (Top)
# Test function: Do not remove
m, n = X_train.shape
test_theta = np.ones((3, n))  
cost = my_J(test_theta, X_train.loc[10], y_train[10], 0, 3)
assert isinstance(cost, float), 'cost should be floating point'

print("success!")
# End Test function
success!
  1. Implement my_grad_cost using your my_J function
In [17]:
Student's answer(Top)
def my_grad_cost(X, y, j, theta, num_class):
    grad = None
    cost = None
    # YOUR CODE HERE
    m, n = X.shape
    sum = np.array([0 for i in range(0,n)])
    cost = 0.0
    for i in range(0, m):
        p = indicator(y[i], j) - phi(j, theta, X.loc[i], num_class)
        sum = sum + (X.loc[i] * p)
        cost += my_J(theta, X.loc[i], y_train[i], j, num_class)
    grad = -sum / m
    
    return grad, cost
In [18]:
Grade cell: cell-0c59178b69fc0e79 Score: 5.0 / 5.0 (Top)
# Test function: Do not remove
m, n = X_train.shape
test_theta = np.ones((3, n))  
grad, cost = my_grad_cost(X_train, y_train, 0, test_theta, num_class)
print(grad)
print(cost)
assert isinstance(cost, float), 'cost should be floating point'
assert isinstance(grad['intercept'], float) and \
        isinstance(grad['SepalLengthCm'], float) and \
        isinstance(grad['SepalWidthCm'], float) and \
        isinstance(grad['PetalLengthCm'], float) and \
        isinstance(grad['PetalWidthCm'], float) , 'grad should be floating point'
print("success!")
# End Test function
intercept       -1.480297e-17
SepalLengthCm    2.952381e-01
SepalWidthCm    -1.200000e-01
PetalLengthCm    7.752381e-01
PetalWidthCm     3.263492e-01
dtype: float64
38.45143010338384
success!

Expect result: (*or similar*)\ intercept 0.009524\ SepalLengthCm 0.316825\ SepalWidthCm -0.091429\ PetalLengthCm 0.780000\ PetalWidthCm 0.329524\ dtype: float64\ 37.352817814715735

  1. Implement my_gradient_descent using your my_grad_cost function
In [19]:
Student's answer(Top)
def my_gradient_descent(X, y, theta, alpha, iters, num_class):        
    cost_arr = []
    # YOUR CODE HERE
    n = X.shape[1]
    for iter in range(iters):
        dtheta = np.zeros((num_class, n))
        total_cost = 0.0
        for j in range(0, num_class):
            dtheta[j,:],cost = my_grad_cost(X, y, j, theta, num_class)
            total_cost += cost
        theta = theta - alpha * dtheta
        cost_arr.append(total_cost)
    return theta, cost_arr
In [20]:
Grade cell: cell-8b7f6daed2ecf043 Score: 5.0 / 5.0 (Top)
# Test function: Do not remove
m, n = X_train.shape
test_theta = np.ones((3, n))  
theta, cost = my_gradient_descent(X_train, y_train, theta_initial, 0.001, 5, 3)
print(theta)
print(cost)
print("success!")
# End Test function
[[1.00001532 0.99861875 1.00064608 0.99619293 0.99839151]
 [0.99999803 1.00013986 0.99954657 1.00086174 1.00024112]
 [0.99998665 1.00124139 0.99980735 1.00294533 1.00136737]]
[115.35429031015153, 115.21228800851405, 115.0733801505113, 114.93743744768908, 114.80433627268494]
success!

Expected result: (*or similar*)\ [[1.00001186 0.99618853 1.00183642 0.9889817 0.99528923]\ [1.00009697 1.0011823 0.99883395 1.00316763 1.00083055]\ [0.99987915 1.00255606 0.99929351 1.00779768 1.00386218]]\ [114.00099216453735, 113.89036233839263, 113.78163144339288, 113.67472269747496, 113.56956268162737]\ 37.352817814715735

Exercise 2.2 (20 points)

  1. Plot the training set and test cost as training goes on and find the best value for the number of iterations and learning rate.
  2. Make 2D scatter plots showing the predicted and actual class of each item in the training set, plotting two features at a time. Comment on the cause of the errors you observe. If you obtain perfect test set accuracy, re-run the train/test split and rerun the optimization until you observe some mistaken predictions on the test set.
In [21]:
import matplotlib.pyplot as plt
In [22]:
Student's answer Score: 20.0 / 20.0 (Top)
theta_arr = []
cost_arr = []
accuracy_arr = []

# design your own learning rate and num iterations
alpha_arr = np.array([None, None, None, None])
iterations_arr = np.array([None, None, None, None])

# YOUR CODE HERE
alpha_arr = np.array([0.001, 0.02, 0.07, 0.1])
iterations_arr = np.array([200, 350, 400, 450])

for iterations in iterations_arr:
    for alpha in alpha_arr:
        theta_initial = np.ones((num_class, c))
        theta, cost = my_gradient_descent(X_train, y_train, theta_initial, alpha, iterations, num_class)
        cost_arr.append(cost)
        theta_arr.append(theta)
        
        r,c = X_test.shape
        y_pred = []
        for index,row in X_test.iterrows(): 
            y_hat = h(row, theta, num_class)
            prediction = int(np.argmax(y_hat))
            y_pred.append(prediction)
        
        accuracy = calc_accuracy(y_test, y_pred)
        accuracy_arr.append(accuracy)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-22-47ae4bd79039> in <module>
     14     for alpha in alpha_arr:
     15         theta_initial = np.ones((num_class, c))
---> 16         theta, cost = my_gradient_descent(X_train, y_train, theta_initial, alpha, iterations, num_class)
     17         cost_arr.append(cost)
     18         theta_arr.append(theta)

<ipython-input-19-3a94f1b5f016> in my_gradient_descent(X, y, theta, alpha, iters, num_class)
      7         total_cost = 0.0
      8         for j in range(0, num_class):
----> 9             dtheta[j,:],cost = my_grad_cost(X, y, j, theta, num_class)
     10             total_cost += cost
     11         theta = theta - alpha * dtheta

<ipython-input-17-927661bcf98c> in my_grad_cost(X, y, j, theta, num_class)
      9         p = indicator(y[i], j) - phi(j, theta, X.loc[i], num_class)
     10         sum = sum + (X.loc[i] * p)
---> 11         cost += my_J(theta, X.loc[i], y_train[i], j, num_class)
     12     grad = -sum / m
     13 

<ipython-input-15-d1c130448ab7> in my_J(theta, X, y, j, num_class)
      2     cost = None
      3     # YOUR CODE HERE
----> 4     cost = - indicator(y,j) * math.log(phi(j, theta, X, num_class))
      5     return cost

<ipython-input-3-c8a2c8795861> in phi(i, theta, X, num_class)
     25     den = 0
     26     for j in range(0,num_class):
---> 27         mat_theta_j = np.matrix(theta[j])
     28         den = den + math.exp(np.dot(mat_theta_j, mat_x.T))
     29     phi_i = num / den

~/anaconda3/lib/python3.8/site-packages/numpy/matrixlib/defmatrix.py in __new__(subtype, data, dtype, copy)
    114     __array_priority__ = 10.0
    115     def __new__(subtype, data, dtype=None, copy=True):
--> 116         warnings.warn('the matrix subclass is not the recommended way to '
    117                       'represent matrices or deal with linear algebra (see '
    118                       'https://docs.scipy.org/doc/numpy/user/'

KeyboardInterrupt: 
In [23]:
i = 0

for iterations in iterations_arr:
    plt.grid(axis='both')
    for alpha in alpha_arr:
        x_axis = np.arange(0,iterations)
        cost = cost_arr[i]
        label = "alpha: " + str(alpha) + " acc: " + str(np.round(accuracy_arr[i] * 100,2)) + "%"
        plt.plot(x_axis, cost,label=label)
        i += 1
    title = "For " + str(iterations) + " Iterations"
    plt.title(title)
    plt.xlabel('Iterations')
    plt.ylabel('Cost')
    plt.legend()
    plt.show()
    plt.clf()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-23-b5dc38541544> in <module>
      5     for alpha in alpha_arr:
      6         x_axis = np.arange(0,iterations)
----> 7         cost = cost_arr[i]
      8         label = "alpha: " + str(alpha) + " acc: " + str(np.round(accuracy_arr[i] * 100,2)) + "%"
      9         plt.plot(x_axis, cost,label=label)

IndexError: list index out of range

From the plots, it is seen that the learning rate of 0.1 and 450 number of iternations produced the best results. The cost reduced relatively more quickly in this configuration than other combinations. However, at one point during the training, the cost increased and performed worse than the model trained with a lower learning rate of 0.07 as observed from the graph. Though it started decreasing again shortly, the cost did not decrease smoothly. It is only after a certain iternation, did the cost of model decreased lower than the cost of other model and also smoothly. From this, it can be concluded that increasing learning more than this point might cause the model learn from data non-smoothly and might perform worse for a certain amount of iteration.

It is noted that the cost for model which performed the best would still decrese a bit and is still yet to reach its convergence. Maybe training for a longer number of iternations would slightly improve the model. One optimal algorithm which can be used for this problem is early stopping. It is basically a method that stops the training once model stops improving to avoid overfitting. Using this method, larger number of iternations can be defined and hence eliminates the task for defining optimal number of iteratinons to large extend.

Expected result: (*Yours doesn't have to be the same!*)

lab4-02.png

Student's task Score: 40.0 / 40.0 (Top)

On your own to take home

We see that the Iris dataset is pretty easy. Depending on the train/test split, we get 95-100% accuracy.

Find a more interesting multi-class classification problem on Kaggle (Tell the reference), clean the dataset to obtain numerical input features without missing values, split the data into test and train, and experiment with multinomial logistic regression.

Write a brief report on your experiments and results. As always, turn in a Jupyter notebook by email to the instructor and TA.

Take Home Solution

In [24]:
from IPython.display import Image
Image(filename='data-original.png')
Out[24]:

Reference

The dataset which I used for take-home exercise was taken from 'Palmer Archipelago (Antarctica) penguin data' in Kaggle. It is a great dataset for practising data exploration & visualization and is seen as an iternative to iris dataset. For my exeriment, based on the information provided on penguins, I tried to create a suitable model which will guess the species of the penguin. The species given in this dataset are chinstrap, gentoo, and adelie.

Note: There were two datasets provided from 'Palmer Archipelago (Antarctica) penguin data'. One was the simplified version of the other data. The one I used was the simplified version since it contained all the important information and omits unnecessary details.

Loading Data

In [25]:
def load_data(file_name, drop_label, y_label, is_print=False):
    data = pd.read_csv(file_name)
    if is_print:
        print(data.head())
    if drop_label is not None:
        data = data.drop([drop_label],axis=1)
        if is_print:
            print(data.head())
            
    y_index = data.columns.get_loc(y_label)

    X = data.iloc[:, y_index+1:]
    y = data.iloc[:, y_index]
    
    return X, y
In [26]:
# Loading datset
X, y = load_data('penguins_size.csv', None, 'species', True)
  species     island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen              39.1             18.7              181.0   
1  Adelie  Torgersen              39.5             17.4              186.0   
2  Adelie  Torgersen              40.3             18.0              195.0   
3  Adelie  Torgersen               NaN              NaN                NaN   
4  Adelie  Torgersen              36.7             19.3              193.0   

   body_mass_g     sex  
0       3750.0    MALE  
1       3800.0  FEMALE  
2       3250.0  FEMALE  
3          NaN     NaN  
4       3450.0  FEMALE  

Handling Missing Values

In [27]:
# Check for missing values in the training and test data

print("Current Missing Data")
print('Missing values for X dataset:\n------------------------\n', X.isnull().sum())
print('Missing values for y dataset \n ------------------------\n', y.isnull().sum())
Current Missing Data
Missing values for X dataset:
------------------------
 island                0
culmen_length_mm      2
culmen_depth_mm       2
flipper_length_mm     2
body_mass_g           2
sex                  10
dtype: int64
Missing values for y dataset 
 ------------------------
 0

Handling Missing Value for Sex Category

In [28]:
print(X['sex'].value_counts())
MALE      168
FEMALE    165
.           1
Name: sex, dtype: int64
In [29]:
gender = X['sex'].value_counts()
print('Elements in Married variable', gender.shape)
print('Married ratio ', gender[0]/sum(gender.values))
Elements in Married variable (3,)
Married ratio  0.5029940119760479
In [30]:
X['sex'].fillna('MALE', inplace = True, limit = 5)
X['sex'].fillna('FEMALE', inplace = True, limit = 5) 
In [31]:
print('Missing values for X dataset:\n------------------------\n', X.isnull().sum())
Missing values for X dataset:
------------------------
 island               0
culmen_length_mm     2
culmen_depth_mm      2
flipper_length_mm    2
body_mass_g          2
sex                  0
dtype: int64

Handling Missing Value for culmen_length_mm Category

In [32]:
print(X['culmen_length_mm'].value_counts())
41.1    7
45.2    6
39.6    5
37.8    5
46.5    5
       ..
45.9    1
42.4    1
55.1    1
52.7    1
51.9    1
Name: culmen_length_mm, Length: 164, dtype: int64
In [33]:
print('mean loan amount ', np.mean(X['culmen_length_mm']))
culmentLength_mm_mean = np.mean(X['culmen_length_mm'])
X['culmen_length_mm'].fillna(culmentLength_mm_mean, inplace=True, limit = 2)
mean loan amount  43.92192982456142
In [34]:
print('Missing values for X dataset:\n------------------------\n', X.isnull().sum())
Missing values for X dataset:
------------------------
 island               0
culmen_length_mm     0
culmen_depth_mm      2
flipper_length_mm    2
body_mass_g          2
sex                  0
dtype: int64

Handling Missing Value for Other Categories in a Similar Way

In [35]:
print("culmen_depth_mm")
print(X['culmen_depth_mm'].value_counts())
print("------------------------")
print("flipper_length_mm")
print(X['flipper_length_mm'].value_counts())
print("------------------------")
print("body_mass_g")
print(X['body_mass_g'].value_counts())
culmen_depth_mm
17.0    12
18.6    10
15.0    10
18.5    10
17.9    10
        ..
20.8     1
13.4     1
17.4     1
20.6     1
13.3     1
Name: culmen_depth_mm, Length: 80, dtype: int64
------------------------
flipper_length_mm
190.0    22
195.0    17
187.0    16
193.0    15
210.0    14
191.0    13
215.0    12
196.0    10
197.0    10
185.0     9
216.0     8
208.0     8
198.0     8
220.0     8
181.0     7
186.0     7
192.0     7
230.0     7
184.0     7
189.0     7
212.0     7
188.0     6
199.0     6
217.0     6
201.0     6
213.0     6
222.0     6
214.0     6
180.0     5
194.0     5
203.0     5
219.0     5
221.0     5
209.0     5
218.0     5
202.0     4
228.0     4
178.0     4
225.0     4
200.0     4
205.0     3
224.0     3
182.0     3
223.0     2
229.0     2
211.0     2
207.0     2
183.0     2
174.0     1
206.0     1
176.0     1
231.0     1
172.0     1
179.0     1
226.0     1
Name: flipper_length_mm, dtype: int64
------------------------
body_mass_g
3800.0    12
3700.0    11
3950.0    10
3900.0    10
3550.0     9
          ..
4975.0     1
4275.0     1
3100.0     1
3275.0     1
4775.0     1
Name: body_mass_g, Length: 94, dtype: int64
In [36]:
culmentLength_mm_mean = np.mean(X['culmen_length_mm'])
In [37]:
print('mean loan amount for culmen_depth_mm ', np.mean(X['culmen_depth_mm']))
print('mean loan amount for flipper_length_mm ', np.mean(X['flipper_length_mm']))
print('mean loan amount for body_mass_g ', np.mean(X['body_mass_g']))

culmentDepth_mean,flipperLenth_mean,bodyMass_mean = np.mean(X['culmen_depth_mm']),np.mean(X['flipper_length_mm']),np.mean(X['body_mass_g'])
X['culmen_depth_mm'].fillna(culmentDepth_mean, inplace=True, limit = 2)
X['flipper_length_mm'].fillna(flipperLenth_mean, inplace=True, limit = 2)
X['body_mass_g'].fillna(bodyMass_mean, inplace=True, limit = 2)
mean loan amount for culmen_depth_mm  17.151169590643278
mean loan amount for flipper_length_mm  200.91520467836258
mean loan amount for body_mass_g  4201.754385964912
In [38]:
print('Missing values for X dataset:\n------------------------\n', X.isnull().sum())
Missing values for X dataset:
------------------------
 island               0
culmen_length_mm     0
culmen_depth_mm      0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

Converting Categorical Columns into Numeric Entries

In [39]:
X.select_dtypes(['object']).columns
Out[39]:
Index(['island', 'sex'], dtype='object')

Exploring 'sex' column

In [40]:
print(X['sex'].value_counts())
MALE      173
FEMALE    170
.           1
Name: sex, dtype: int64
In [41]:
X['sex'] = X['sex'].astype("category").cat.codes -1
In [42]:
X.select_dtypes(['object']).columns
Out[42]:
Index(['island'], dtype='object')

Exploring 'island' column

In [43]:
print(X['island'].value_counts())
Biscoe       168
Dream        124
Torgersen     52
Name: island, dtype: int64
In [44]:
X['island'] = X['island'].astype("category").cat.codes
In [45]:
X.select_dtypes(['object']).columns
Out[45]:
Index([], dtype='object')
In [46]:
X.head()
Out[46]:
island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
0 2 39.10000 18.70000 181.000000 3750.000000 1
1 2 39.50000 17.40000 186.000000 3800.000000 0
2 2 40.30000 18.00000 195.000000 3250.000000 0
3 2 43.92193 17.15117 200.915205 4201.754386 1
4 2 36.70000 19.30000 193.000000 3450.000000 0

Normalizing Data

Data for columns 'culmen_length_mm','culmen_depth_mm','flipper_length_mm', and 'body_mass_g' were normalized since taking exponential of the dot product of training data and theta results in 'inf' due to their large values.

In [47]:
means = np.mean(X[['culmen_length_mm','culmen_depth_mm','flipper_length_mm','body_mass_g']], axis=0)
stds = np.std(X[['culmen_length_mm','culmen_depth_mm','flipper_length_mm','body_mass_g']], axis=0)
X_norm = X.copy()
X_norm[['culmen_length_mm','culmen_depth_mm','flipper_length_mm','body_mass_g']] = (X[['culmen_length_mm','culmen_depth_mm','flipper_length_mm','body_mass_g']] - means) / stds
In [48]:
X_norm.head()
Out[48]:
island culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g sex
0 2 -8.870812e-01 7.877425e-01 -1.422488 -0.565789 1
1 2 -8.134940e-01 1.265563e-01 -1.065352 -0.503168 0
2 2 -6.663195e-01 4.317192e-01 -0.422507 -1.192003 0
3 2 1.307172e-15 1.806927e-15 0.000000 0.000000 1
4 2 -1.328605e+00 1.092905e+00 -0.565361 -0.941517 0

Splitting Data into Train and Test Dataset

In [49]:
percent_train = 0.7
idx, X_train, y_train, X_test, y_test, y_labels_name, y_labels_new = partition(X_norm, y, percent_train)
print('X_train.shape', X_train.shape)
print('X_test.shape', X_test.shape)
print('y_train.shape', y_train.shape)
print('y_test.shape', y_test.shape)
print('y_labels_name: ', y_labels_name)
print('y_labels_new: ', y_labels_new)
print(X_train.head())
print(y_train.head())
X_train.shape (240, 6)
X_test.shape (104, 6)
y_train.shape (240,)
y_test.shape (104,)
y_labels_name:  ['Adelie' 'Chinstrap' 'Gentoo']
y_labels_new:  [0 1 2]
   island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  body_mass_g  \
0       1          1.265345         0.940324           0.648902    -0.127440   
1       1         -0.132812        -0.280327          -0.993924    -1.630352   
2       2         -1.328605         0.838603          -0.993924    -0.503168   
3       0          0.879012        -0.738072           0.506047     1.438093   
4       0          1.265345        -0.738072           1.791737     1.250229   

   sex  
0    1  
1    0  
2    0  
3    1  
4    1  
0    1
1    1
2    0
3    2
4    2
Name: species, dtype: object

Training the Model

In [50]:
num_class = len(y_labels_name)

if (X_train.shape[1] == X.shape[1]): 
    X_train.insert(0, "intercept", 1)
if (X_test.shape[1] == X.shape[1]): 
    X_test.insert(0, "intercept", 1)
    
r,c = X_test.shape

theta_arr = []
cost_arr = []
accuracy_arr = []

alpha_arr = np.array([0.001, 0.02, 0.07, 0.1])
iterations = 450

for alpha in alpha_arr:
    theta_initial = np.ones((num_class, c))
    theta, cost = my_gradient_descent(X_train, y_train, theta_initial, alpha, iterations, num_class)
    cost_arr.append(cost)
    theta_arr.append(theta)

    r,c = X_test.shape
    y_pred = []
    for index,row in X_test.iterrows(): 
        y_hat = h(row, theta, num_class)
        prediction = int(np.argmax(y_hat))
        y_pred.append(prediction)

    accuracy = calc_accuracy(y_test, y_pred)
    accuracy_arr.append(accuracy)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-50-d8ba88f362ba> in <module>
     17 for alpha in alpha_arr:
     18     theta_initial = np.ones((num_class, c))
---> 19     theta, cost = my_gradient_descent(X_train, y_train, theta_initial, alpha, iterations, num_class)
     20     cost_arr.append(cost)
     21     theta_arr.append(theta)

<ipython-input-19-3a94f1b5f016> in my_gradient_descent(X, y, theta, alpha, iters, num_class)
      7         total_cost = 0.0
      8         for j in range(0, num_class):
----> 9             dtheta[j,:],cost = my_grad_cost(X, y, j, theta, num_class)
     10             total_cost += cost
     11         theta = theta - alpha * dtheta

<ipython-input-17-927661bcf98c> in my_grad_cost(X, y, j, theta, num_class)
      7     cost = 0.0
      8     for i in range(0, m):
----> 9         p = indicator(y[i], j) - phi(j, theta, X.loc[i], num_class)
     10         sum = sum + (X.loc[i] * p)
     11         cost += my_J(theta, X.loc[i], y_train[i], j, num_class)

~/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    877 
    878             maybe_callable = com.apply_if_callable(key, self.obj)
--> 879             return self._getitem_axis(maybe_callable, axis=axis)
    880 
    881     def _is_scalar_access(self, key: Tuple):

~/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1108         # fall thru to straight lookup
   1109         self._validate_key(key, axis)
-> 1110         return self._get_label(key, axis=axis)
   1111 
   1112     def _get_slice_axis(self, slice_obj: slice, axis: int):

~/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py in _get_label(self, label, axis)
   1057     def _get_label(self, label, axis: int):
   1058         # GH#5667 this will fail if the label is not present in the axis.
-> 1059         return self.obj.xs(label, axis=axis)
   1060 
   1061     def _handle_lowerdim_multi_index_axis0(self, tup: Tuple):

~/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
   3511             new_values = self._mgr.fast_xs(loc)
   3512 
-> 3513             result = self._constructor_sliced(
   3514                 new_values,
   3515                 index=self.columns,

~/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    325                     data = data.copy()
    326             else:
--> 327                 data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
    328 
    329                 data = SingleBlockManager.from_array(data, index)

~/anaconda3/lib/python3.8/site-packages/pandas/core/construction.py in sanitize_array(data, index, dtype, copy, raise_cast_failure)
    426         else:
    427             # we will try to copy be-definition here
--> 428             subarr = _try_cast(data, dtype, copy, raise_cast_failure)
    429 
    430     elif isinstance(data, ABCExtensionArray):

~/anaconda3/lib/python3.8/site-packages/pandas/core/construction.py in _try_cast(arr, dtype, copy, raise_cast_failure)
    537     # perf shortcut as this is the most common case
    538     if isinstance(arr, np.ndarray):
--> 539         if maybe_castable(arr) and not copy and dtype is None:
    540             return arr
    541 

~/anaconda3/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in maybe_castable(arr)
   1196         return is_timedelta64_ns_dtype(arr.dtype)
   1197 
-> 1198     return arr.dtype.name not in _POSSIBLY_CAST_DTYPES
   1199 
   1200 

~/anaconda3/lib/python3.8/site-packages/numpy/core/_dtype.py in _name_get(dtype)
    334     # append bit counts
    335     if _name_includes_bit_suffix(dtype):
--> 336         name += "{}".format(dtype.itemsize * 8)
    337 
    338     # append metadata to datetimes

KeyboardInterrupt: 

Plot

In [51]:
i = 0

plt.grid(axis='both')
for alpha in alpha_arr:
    x_axis = np.arange(0,iterations)
    cost = cost_arr[i]
    label = "alpha: " + str(alpha) + " acc: " + str(np.round(accuracy_arr[i] * 100,2)) + "%"
    plt.plot(x_axis, cost,label=label)
    i += 1
title = "Iterations vs. Cost Plot"
plt.title(title)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.legend()
plt.show()
plt.clf()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-51-08258596aec2> in <module>
      4 for alpha in alpha_arr:
      5     x_axis = np.arange(0,iterations)
----> 6     cost = cost_arr[i]
      7     label = "alpha: " + str(alpha) + " acc: " + str(np.round(accuracy_arr[i] * 100,2)) + "%"
      8     plt.plot(x_axis, cost,label=label)

IndexError: list index out of range

Summary

1) Data from 'Palmer Archipelago (Antarctica) penguin data' was uploaded

2) Missing Values in the data were filled in

3) Data for columns 'culmen_length_mm','culmen_depth_mm','flipper_length_mm', and 'body_mass_g' were normalized

4) Categorical columns were converted into numeric entries

5) Data was split into train and test datasets

6) Model was trained with 4 different learning rates

Conclusion and Remarks

From iternation vs. cost plot, it is observed that increasing learning rate allows the model to learn from the data more quickly for a few number of iterations and hence reaches its convergence sooner. However, as the training progessed, the model trained with lower learning rate seemed to be catching up with the model with higher learning rate. From this observation, one way which can be used to train the model more effeciently is to first begin with a relatively large learning rate. Then as the training progresses, the learning rate of the model decreases gradually to increase the probablity of the model converging in local minima of cost function. Early stopping, discusssed briefly in exercsie 2.2, can also be added to further improve the training so longer number of iternations can be used.